Optimal box-covering algorithm for fractal dimension of complex networks 
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The self-similarity of complex networks is typically investigated through computational algorithms 
the primary task of which is to cover the structure with a minimal number of boxes. Here we 
introduce a box-covering algorithm that not only outperforms previous ones, but also finds optimal 
solutions. For the two benchmark cases tested, namely, the E. Coli and the WWW networks, our 
results show that the improvement can be rather substantial, reaching up to 15% in the case of the 
WWW network. 



I. INTRODUCTION 

The topological and dynamical aspects of complex net- 
works have been the focus of intensive research during 
the last years [Tl-115]. An open and unsolved problem in 
network and computer science is the following question: 
how to cover a network with the fewest possible number 
of boxes of a given size [I61 - I2T] ? In a complex network, 
a box size can be defined in terms of the chemical dis- 
tance, I b 7 which corresponds to the number of edges on 
the shortest path between two nodes. This means that 
every node is less than l B edges away from another node 
in the same box. Here we use the burning approach for 
the box covering problem [52] , thus the boxes are defined 
for a central node or edge. Instead of calculating the 
distance between every pair of nodes in a box, the max- 
imal distance to the central node or edge vb is given. 
This distance can then be related to the size of the box 
Tb = (Ib — l)/2 for a central node and tb = Ib/2 for 
a central edge. The maximal chemical distance within a 
box of a given size tb is 2tb for a central node and 2rs — 1 
for a central edge. Although this problem can be simply 
stated, its solution is known to be NP-hard [23]. It can 
be also mapped to a graph coloring problem in computer 
science |19) and has important applications, e.g., the cal- 
culation of fractal dimensions of complex networks (241 — 
[29] or the identification of the most influential spreaders 
in networks [30] • Here we introduce an efficient algo- 
rithm for fractal networks which is capable to determine 
the minimum number of boxes for a given parameter Ib 
or tb- Moreover, we compare it for two benchmark net- 
works with a standard algorithm used to approximately 
obtain the minimal number of boxes. In principle, the op- 
timal solution should be identified by testing exhaustively 
all possible solutions. Nevertheless, for practical pur- 
poses, this approach is unfeasible, since the solution space 
with its 2 N solutions is too large. Present algorithms like 
maximum-excluded-mass-burning |22j and merging algo- 
rithms [3T] are based on the sequential addition of the 
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box with the highest score, e.g., the score is proportional 
to the number of covered nodes, and the boxes with the 
highest score are sequentially included. Other algorithms 
are based on simulated annealing |32) . but without the 
guarantee of finding the optimal solution. Even greedy 
algorithms end up with a similar number of boxes as the 
algorithms mentioned before |20j . The greedy algorithm 
sequentially includes a node to a present box, if all other 
nodes in this box are within the chemical distance Ib and 
if there is no such box, a new box with the new node is 
created. It is therefore believed that the results are close 
to the optimal result, although the real optimal solution 
is unknown. 

This paper is organized as follows. In Section II, we intro- 
duce the algorithm and then explain the main difference 
between the present state of the art algorithm and our 
optimal algorithm for a given distance tb- In Section 
III, results for two benchmark networks are presented 
and the improvement in performance of our algorithm is 
quantitatively shown. Finally, in Section IV, we present 
conclusions and perspectives for future work. 



II. THE ALGORITHM 

We use two slightly different algorithms for the 
calculation of the optimal box covering solution, one for 
odd values of Ib and another for even values Ib- To 
get the results for an odd value, the following rules are 
applied: 

1. Create all possible boxes: For every node i create a 
box Bi containing all nodes that are at most tb — 
(Ib — l)/2 edges away. Node i is called center of 
the box. An example is shown in Fig. [lp,. 

2. Remove unnecessary boxes: Search and remove all 
boxes Bi which are fully contained in another box 
Bj (See Fig. ^p). 

3. Remove unnecessary nodes: For every node i, check 
all the boxes containing i: B^, Bi n . If another 
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node j i is contained in all of these boxes, remove 
it from all boxes (see Fig. [lj;). 

4. Remove pairs of unnecessary twin boxes: Find two 
nodes i,j which are both in exactly two boxes of 
size two: B il = {i,k\}, B- l2 = {i, fc 2 } and Bj ± = 
{j,h}, Bj 2 = {j,l 2 }- If h = h and fc 2 = l 2 , then 
Bi 2 and Bj ± can be removed. If k\ = l 2 and k% = h, 
then Bi 2 and Bj 2 can be removed. An example for 
this rule is shown in Fig. [2] Note that such twin 
boxes also appear for Ib > 2 due to the removal of 
unnecessary nodes. 

5. Search for boxes that must be contained in the so- 
lution: Add all boxes Bi to the solution, which have 
a node i only present in this box. Remove all nodes 
j =/= i covered by Bi from other boxes. 

6. Iterate A: Repeat 2-5 until there is no node which 
is covered by a single box and is not part of the 
solution. 

7. System split: Identify if the remaining network can 
be divided into subnetworks, such that all boxes 
in a subnetwork contain only nodes of this subnet- 
work. Then these subnetworks can be processed 
independent from each other. 

8. System split: Find the node which is in the smallest 
number of boxes Ab oxes , each of these boxes covers 
another set of nodes Bi . If there is more than one 
node fulfilling this criterion, chose the node which 
is covered by the largest boxes. Then the algorithm 
is divided into Ab OX es sub-algorithms, which can be 
independently calculated in parallel. By removing 
from each of the iVboxes sub-algorithm another set 
of nodes Bi, all possible solutions are considered. 
An example for the splitting is shown in Fig. [3] 
Since we want to identify only one optimal solution, 
we do not need to calculate the results of all sub- 
algorithms. As soon as one of the sub-algorithms 
identifies an optimal solution, we can skip the cal- 
culation of the others. Furthermore, the calcula- 
tion of a sub-algorithm can be skipped, if the min- 
imal number of required additional boxes reaches 
the number of the, so far, best solution of a paral- 
lel sub-algorithm. 

9. Iterate B: Repeat 2-8 until no nodes are uncovered. 

10. Identify the best solution: Chose the solution with 
the lowest number of boxes. This solution is opti- 
mal for a given tb- 

To get the results for an even value of I b the first step 
is slightly different: 

1. Create all possible boxes: For every edge i create a 
box Bi containing all nodes that are at most tb — 
Ib/2 nodes away. Edge i is called center of the box. 




FIG. 1. The box covering algorithm on a small example net- 
work for the box size Ib — 3 (rg = 1 with a central node). 
Upper panel: a) Step 1: Calculation of all possible boxes. 
The color of the boxes corresponds to the node in its center, 
b) Step 2: All boxes that are fully contained in another box 
are removed. In this example the boxes B\, B2, B$, and B7 
are removed, c) Step 3: All nodes which are in all boxes of 
another node are removed. In this example, nodes 2,3,4 are 
in the same box with node 1 as well as nodes 4,5,6 are in the 
same box with node 7. d) The final, optimal solution is shown 
on the right side. 

Lower panel: The three possible solutions for the greedy box 
covering algorithm, based on the largest box sizes. In this 
case, the boxes are included to the solution according to the 
number of new covered nodes. Since three boxes -B3, B4 and 
Bq have the same number of nodes, the algorithm finds three 
different solutions e) (B 3 ,B 6 ), f) (B 6 ,B 3 ), and g) (B 4 ,B 1 ,Bt), 
where the last one is not optimal. 




FIG. 2. Step 4: In this example two nodes are in the same 
box, if they are connected with an edge. The two boxes be- 
tween nodes 1 and 5 and between nodes 2 and 3 are removed 
according to rule 4. 




FIG. 3. Step 8: Node 4 is covered by 2 circles (the minimal 
number of boxes) and the algorithm splits. The first sub- 
algorithm continues with box B$ (middle), while the second 
one continues with box B3 (right). 
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All other steps are the same as for the odd case. Note 
that the calculation for odd values scales with the number 
of nodes of the network N and with the number of edges 
M for even values. 



III. RESULTS FOR TWO BENCHMARK 
NETWORKS 

Instead of sequentially including boxes, the idea of 
our algorithm is to remove all non-optimal boxes from 
the solution space ending up with a final, optimal solu- 
tion. To reduce the huge solution space, our box cover- 
ing algorithm uses two basic ingredients: 1) Unnecessary 
boxes from the solution space are discarded and the boxes 
which definitively belong to the solution are kept. 2) Un- 
necessary nodes from the network are discarded. These 
two steps reduce the solution space of a wide range of 
network types significantly, specially if they are applied 
in alternation as the removal of a box can lead to the 
removal of nodes and other boxes and vice-versa. Nev- 
ertheless these two steps do not necessarily lead to the 
optimal solution, thus the solution space has to be split 
into several possible sub-solution spaces. In each of these 
sub-solutions the first two steps are repeated. Note that 
the splitting does not reduce the number of possible so- 
lutions, thus only the first two steps reduce the solution 
space and in the worst case, the algorithm must calculate 
the entire solution space. In any case, for many complex 
networks iterating these three steps significantly reduces 
the solution space to a few solutions from which the op- 
timal box covering can be obtained. 
The remaining question is how to judge whether a box or 
node is necessary or unnecessary. On the one hand a box 
is unnecessary if all nodes of a box are also part of an- 
other box. This box can be removed, because the other 
box covers at least the same nodes and often additional 
nodes. On the other hand a box is necessary if a node is 
exclusively covered by this single box. This box has to 
belong to the solution, since only if the box is part of the 
solution, the node is covered. 

In contrast, nodes can easily be identified as unnecessary. 
For example all nodes of a box, which is part of the solu- 
tion, can be removed from all other boxes, since they are 
already covered. Additionally, if a node shares all boxes 
with another node, the other node can be removed, since 
the second node is always covered, if the first node is cov- 
ered. These few rules are in principle sufficient to get the 
optimal solution, since our algorithm starts with all 2 N 
or 2 M (for central edges) possible solutions and discards 
unnecessary and includes necessary boxes. 
Although we only calculate results for undirected, un- 
weighted networks, the algorithm can easily be extended 
to directed and weighted networks. In both cases only 
the initial step, the creation of boxes, is different. For di- 
rected networks, the box around a central node contains 
all nodes which are reachable with respect to the direc- 
tion, while for weighted networks, the distance is the sum 




FIG. 4. Comparison of the minimal number of boxes N(Ib) 
for a given distance Ib for the E. Coli network using the greedy 
graph coloring algorithm and our optimal algorithm. While 
the decay for both box covering methods is similar in the 
logarithmic plot, the minimal number of boxes is different. 
Although the difference AN = -/V grcc< iy — -/V opt i ma i seems to be 
small, the relative improvement AN /-/V gre edy, which is shown 
in the inset, is significant for small distances Ib < 7. Note 
that the larger the box size the simpler the network can be 
covered with the optimal number of boxes. The straight line 
shows a power-law behavior, where the best fit for the fractal 
dimension is ds = 3.47 ±0.11 for the greedy graph coloring 
and ds = 3.45 ± 0.10 for our optimal algorithm, respectively. 
Within the error bars both box-covering algorithms yield the 
same fractal dimension. 

of the edge weights between the nodes. 

Next we show that our algorithm can also identify op- 
timal solutions for large networks. Therefore, we have 
applied it to two different benchmark networks, namely 
the E. Coli network [33J , with 2859 proteins and 6890 in- 
teractions between them, and the WWW network [2]. We 
compare the results for the minimal box number N(Ib) 
of our algorithm for different values of box sizes Ib with 
the results of the greedy graph coloring algorithm [], as 
displayed in Fig. [4j While the absolute improvement is 
rather small, the relative improvement is up to 6% larger 
for Ib < 7. If the network is fractal, it should obey the 
relation, 

N{i B )~r B d °, (i) 

where ds is the fractal dimension. Interestingly, it seems 
that the fractal dimension ds — 3.47 ± 0.11 from the 
greedy algorithm and ds — 3.45 ± 0.10 from our opti- 
mal algorithm of the network is nearly unaffected by the 
choice of the algorithm. Note that for Ib = 11, due to the 
fact that the boxes are calculated based on the definition 
of a central node or edge, we have one more box. The 
simplest case where such difference occurs is in a chain 
of four connecting nodes (1-2, 2-3, 3-4, 4-1). All nodes 
have the chemical distances of two to each other (Ib = 3), 
however it is not possible to draw a box around a node 
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FIG. 5. The minimal number of boxes N(Ib) as a function of 
the distance Ib for the WWW network calculated through the 
greedy graph coloring algorithm and our optimal algorithm. 
While the fractal dimension for both box covering methods is 
nearly similar, the minimal number of boxes is different. The 
difference AN = N grcc ^y — -^optimal as well as the relative 
improvement AN/N SICO d y , which is shown in the inset, are 
significant for Ib < 16. For this network a maximal relative 
improvement of about 15% can be obtained. 



FIG. 6. The distribution of minimal number of boxes p(N) 
for the WWW network for Is = 5 calculated through the 
greedy graph coloring algorithm for 1500 different random 
node sequences. We have normalized the results by the opti- 
mal solution obtained from our algorithm. The distribution 
follows a normal distribution p(x) ~ exp(— (x — fi) 2 / (2a 2 )) 
with (jl = 1.07 ± 0.01 and a = 0.003 ± 0.001, thus approx- 
imately 10 120 realizations are necessary to find the optimal 
solution with the greedy algorithm. 



with radius one [tb — 1), which contains all nodes. 

The second example is the WWW network, containing 
325729 nodes and 1090108 edges. As in the previous case, 
our algorithm outperforms the state of the art algorithm, 
but yields similar fractal behavior, as shown in Fig. [5] 
For intermediate box sizes Is < 16, we have a large im- 
provement since up to 15% and up to 611 fewer boxes are 
needed. For Ib = 16, 17, 18 we have two box more, like 
in the E. Coli network case due to the two definitions of 
the box covering problem, while for larger Ib both algo- 
rithm give similar results. Interestingly, it seems that the 
improvement for even distances Ib (for central edges) is 
significantly larger than for odd distances Ib (for central 
nodes) . 

In Fig. [6] we show the influence of the sequence of adding 
nodes to the boxes on the results of the greedy algo- 
rithm. While the results of Fig. [5] are the minimal 
values obtained from 50 independent starting sequences, 
we calculated 1500 realizations for a single box size 
Ib = 5. The difference between the improvement is with 

-^greedy /-^optimal = 6.3% and iVg ree dy /-^optimal = 6.1% 

rather small. The gap between the optimal solution and 
the greedy algorithm is too large, thus for practical pur- 
poses, the greedy algorithm will never find the optimal 
solution for this box size. 

The results for these two benchmark networks demon- 
strate that our algorithm is more effective than the state 
of the art algorithms. Nevertheless, due to the rapid de- 
cay of the number of boxes for larger box sizes, the fractal 
dimension of the two benchmark networks is only slightly 
different when using the optimal box-covering algorithm 



in comparison with other algorithms. 



IV. CONCLUSIONS 

In closing, we have presented a box-covering algorithm, 
which outperforms the known previous ones. We have 
also compared our algorithm with the state of the art 
methods for different benchmark networks and detected 
substantial improvements. Moreover the obtained solu- 
tions are optimal as a result of the algorithm design, if 
the box size is defined as the maximal distance tb to 
the central node or edge. For example, our approach can 
be useful for designing optimal commercial distribution 
networks, where the shops are the nodes, the storage fa- 
cilities the box centers and the radius is related to the 
boundary conditions, like transportation cost or time. 
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